Introduction

library(ggplot2)
library(dplyr)
library(ggmap)
library(leaflet)
library(leaflet.extras)
library(memisc)

Introduction

The existence of Citi Bike is definitely one reason that makes New York City (NYC) unique among other cities in United States. It is a bicycle sharing system opened in May 2013 in NYC with 332 stations. Until now, there are 750 stations and 12,000 bikes serving Manhattan, Queens and Brooklyn (also including New Jersey, but for analysis purpose of this project, we only investigate Citi Bike data in NYC).

However, we usually see way more people using Citi Bike around Midtown than Columbia University (CU). Is this because the limited number of stations or available bikes around CU or people around CU do not prefer to use sharing bicycles? We are interested in the spatial distribution of stations and number of users. We will use R to analyze Citi Bike data in one particular month in order to understand the transportation of sharing bike and other interesting questions: * Whether weather and season have an impact on Citi Bike frequency * Whether there is any trip pattern per week.

Team members and corresponding contributions:

Description of data

Our data is from Motivate. Motivate is the most experienced bike share team.They launched CitiBike in 2013 in NYC.
Besides, they collect System data as CSV files, which can be downloaded from the link above.

In this final project, we analyze the data from June, 2018 because we think it is a common time period when people prefer to ride bikes under relatively mild weather, which can be very representative.

We do realize that season and time will have an impact on people’s choice on riding bikes, so in our analysis, we will include other data file such as weather information from other sources in order to analyze the time and seasons. Besides, the website provides data from June 2013 so we can analyze the time series pattern as well. When it comes to time series, we downloaded data from the source from 2017 June to 2018 June, and sampled each dataset with randomly choosing 5% of the data. Then we concat them together. The code combine the dataset is written in python and is also included in the github repo named sampledata.py.

The features in the dataframe includes:

Analysis of data quality

Load Data

df <- read.csv('./201806-citibike-tripdata.csv')

Definition on outliers

Evaluation on the dataset

Overall, data quality of our dataset is pretty good. The outlier we defined, which is time duration over 120 minutes and stations out of NYC, only accounts for 0.4% data in our dataset. From this point of view, we can drop out those rows.

Main analysis (Exploratory Data Analysis)

1. Analysis on Age distribution of our customers

df_2month = full_join(summary_18_6, summary_17_6, by = "factor_year")
df_3month = full_join(df_2month, summary_16_6, by = "factor_year")

t1 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.x, time = df_3month$time.x)
t2 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.y, time = df_3month$time.y)
t3 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq, time = df_3month$time)

tf = rbind(t1,t2,t3)

ggplot(tf, aes(x = factor_year, y = Freq, color = as.factor(time))) + geom_point() +
  coord_flip() + xlab("user's birth year") + 
  ggtitle("Frequency of Citi Bike users' birth year in different months")+
  labs(fill="Time") + theme_grey(16)

This Cleveland Dot Plot shows the distribution of user’s birth year in 2016/6, 2017/6 and 2018/6. They are all left-skewed, which is what we expect. It is glad to see number of users for all ages increases from 2016 to 2018. In 2018 June, the active users are aged between 24 and 37.

set.seed(321)
index = sample(nrow(df), round(nrow(df)*0.001))
sample = df[index,]
g = ggplot(sample, aes(birth.year,minute))
g + geom_point(alpha = 0.5, col = "#fb6a4a") + ylab("trip duration (in min)") +
  xlab("users' birth year") + ggtitle("Scatterplot of uses' birth year and trip duration (in sec) in Jun 2018") + theme_grey(16)

There are 1.945,611 observations in our data set, so we sampled 1% of them to see the relationship between user’s birth of year and trip durations.
We were expect to see an increasing and then decreasing trend as in the distribution of users’ birth year (right-skewed). However, there is no obvious trend in the scatterplot. Again, since NA birth year were transfered to 1969, there are a lot observations in 1969 and its covers a wide range of trip durations, which is misleading.

2. Analysis on Gender of our customers

gender = data.frame(group = c("unknown", "male", "female"),
                    value = summary(as.factor(df$gender)))
pie = ggplot(gender, aes(x="", y = value, fill=group)) +
  geom_bar(width = 1, stat="identity") + 
  coord_polar("y", start = 0)

blank_theme <- theme_minimal()+
  theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.border = element_blank(),
  panel.grid=element_blank(),
  axis.ticks = element_blank(),
  plot.title=element_text(size=14, face="bold")
  )

pie + scale_fill_manual(values=c("#fbb4ae", "#b3cde3", "#ccebc5")) + 
  blank_theme + theme(axis.text.x=element_blank()) +
  geom_text(aes(y = value/3 + c(0, cumsum(value)[-length(value)]), 
                label = c("10%","66%", "24%")), size=5) + 
  ggtitle("Ratio of users' gender of Citi Bike ") + theme_grey(16)

There are 66% male users, 24% female users and 10% unkown. Apprently, there is a big differnece betwee female users and male users. We have seen some news report that Citi Bike users sometimes complain about the weight of citibike. Maybe that is one of the reasons that female users are small. Thus, we can think of some strategies for female users to increase profits because there is indeed room for improvement.

3. Analysis on Station Distribution and Corresponding Number of Bikes

Below is the heatmap of number of trips for each station. We also label the top 10 popular stations

## get station info 
station.info <- data_18_6 %>%
  group_by(start.station.id) %>% 
  summarise(lat=as.numeric(start.station.latitude[1]),
            long=as.numeric(start.station.longitude[1]),
            name=start.station.name[1],
            n.trips=n())

pal = colorNumeric(
  palette="YlOrRd",
  domain = station.info$n.trips
)

top10 = station.info %>% dplyr::arrange(desc(n.trips))


icon <- makeIcon(
  iconUrl = ".img/citi_bike_icon.png",
  iconWidth = 20, iconHeight = 20,
  iconAnchorX = 22, iconAnchorY = 22,
  shadowWidth = 20, shadowHeight = 20,
  shadowAnchorX = 4, shadowAnchorY = 62
)

leaflet(station.info) %>%setView(-74.00, 40.71, zoom = 12) %>%
  leaflet::addProviderTiles("CartoDB.Positron") %>%
  leaflet::addCircles(lng = ~long, lat = ~lat, color =~pal(n.trips)) %>%
  leaflet::addLegend("bottomleft", pal = pal, value = ~n.trips, title = "Num of trips starting from this station") %>%
  addMarkers(lng = ~top10[1:10,]$long, lat = ~top10[1:10,]$lat,  icon = icon)

We can see that there are more users starting their trips in Manhattan. In Manhattan, Midtown is the most popular area. The station near Central Park is also a popular spot, probably because there are some tourists riding inside Central Park. Back to our first question in the introduction, we are wondering why we do not see a lot of people around campus riding with Citi Bike. The number of stations around campus is not limited, but the starting rides from those stops are small. One possible explanation we believe is that there are quite amount of students living in dorms or very close to campus. So they do not need bicycles to transport.

4. Analysis on weather and season

Join weather data to see if weather has an impact

Intuitively, It is reasonable to assume that weather and season has a huge impact on CitiBike frequency.

NYCweather<-read.csv("./data/NYCweather.csv")
df$Date <- as.Date(df$starttime)
NYCweather$Date<-as.Date(NYCweather$Date)
df_weather <- inner_join(df,NYCweather,by="Date")
df_weather$Severe <- as.factor(df_weather$Severe)

ggplot(df_weather, aes(x=Date))+
  geom_bar(position='dodge', colour="white", fill="#fb6a4a")+
  ggtitle("Number of trips by day") + theme_grey(16)

We made a join of Citi Bike trip data and NYC Weather data of June 2018 to see if weather has an impact on trips. In the histogram below, we showed the number of trips corresponding to each day in June 2018. As we can observe, June-26 has abnormally more trips than other days. One explanation is that maybe missing data is automatically imputed as June01, or maybe there is some special events or promotions on that day, which caused an increase in the number of trips.

In the NYC Weather dataset, a severe weather is defined as either rainy, foggy or snowy. We compared the number of trips corresponding to severe and unsevere weathers, and we can see a difference here. Notice that the number of days with severe weather and days with unsevere weather are both 15, so this comparison can reflect the difference between these two levels.

ggplot(df_weather, aes(x=Severe))+
  geom_bar(colour="white", fill="#fb6a4a")+
  ggtitle("Number of trips per day for severe and unsevere weather condition") +theme_grey(16)

We then visualized the average number of trips per day in different temperatures, windspeeds, and precipitation. These visualizations are limited by the precision of NYC Weather dataset – only average windspeed, temperature and precipitation each day is available, and since only 30 days of data is included, we fail to observe a significant pattern here. However, there are peaks for each metric - the number of trips peaks at tempeature around 72, windspeed around 3.8 m/s, and precipitation of 0.

df_by_temp <- df_weather %>% group_by(Temperature) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_temp, aes(x=as.numeric(Temperature), y=`n_distinct(starttime)`))+
  stat_summary(geom="bar", fill="#fb6a4a")+
  xlab("Temperature")+
  ylab("Average trips per day")+
  ggtitle("Average daily trips in different temperatures") +theme_grey(16)

df_by_windspeed <- df_weather %>% group_by(Windspeed) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_windspeed, aes(x=as.numeric(Windspeed), y=`n_distinct(starttime)`))+
  stat_summary(geom="bar", fill="#fb6a4a")+
  xlab("Windspeed")+
  ylab("Average trips per day")+
  ggtitle("Average daily trips in different windspeeds")+theme_grey(16)

df_by_precip <- df_weather %>% group_by(Precipitation) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_precip, aes(x=as.numeric(Precipitation), y=`n_distinct(starttime)`))+
  stat_summary(geom="bar", fill="#fb6a4a")+
  xlab("Precipitation")+
  ylab("Average trips per day")+
  ggtitle("Average daily trips in different precipitations")+theme_grey(16)

We also compare the average trip duration in severe and unsevere weather, and we can see the difference is minor compared to the difference in the number of trips.

ggplot(data = df_weather,
       aes(x=factor(Severe), y=minute,fill= '#fb6a4a'))+
  xlab("Severity of weather")+ylab("Average trip duration")+ 
  stat_summary(fun.y="mean", geom="bar")+
  ggtitle("Average trip duration for severe and unsevere weather condition") + theme_grey(16)+ theme(legend.position="none")

The next 3 graphs show the relationship between the average trip durations in different windspeed, temperature and precipitation. Due to the scarcity of weather data, we fail to observe a significant pattern in temperature and windspeed, but we can observe the average trip duration drops with precipitation.

ggplot(data = df_weather,
       aes(x = Windspeed, y = minute,color='#fb6a4a')) +
  scale_x_continuous("Windspeed") +
  scale_y_continuous("Average trip duration in minutes") +
  ggtitle("Trip duration vs. Windspeed") + 
  stat_summary(fun.y="mean", geom = "line", size=1)+theme_grey(16)+ theme(legend.position="none") 

ggplot(data = df_weather,
       aes(x = Temperature, y = minute,color='#fb6a4a')) +
  scale_x_continuous("Temperature") +
  scale_y_continuous("Average trip duration in minutes") +
  ggtitle("Trip duration vs. Temperature") + 
  stat_summary(fun.y="mean", geom="line", size=1)+ theme_grey(16)+ theme(legend.position="none") 

ggplot(data = df_weather,
       aes(x = Precipitation, y = minute,color='#fb6a4a')) +
  scale_x_continuous("Precipitation") +
  scale_y_continuous("Average trip duration in minutes") +
  ggtitle("Trip duration vs. Precipitation") +
  stat_summary(fun.y="mean", geom = "line", size=1)+ theme_grey(16)+ theme(legend.position="none")

Interactive time series to see if season has an impact on customer flow with Dygraph

We combine over a year time series to see if season has an impact on customer flow with Dygraph. The data is from 201706 to 201806.

library(dygraphs)
library(xts)
library(lubridate)
library(zoo)
df_timeseries <- read.csv('./data/data_2017_2018.csv')
a <- lubridate::mdy_hms(df_timeseries$starttime)
b <- as.Date(df_timeseries$starttime)
b[is.na(b)] <- a[!is.na(a)]
df_timeseries$starttime <- b
df_timeseries <- df_timeseries %>% mutate(time = format(as.Date(starttime), "%Y-%m"))
df_timeseries <- df_timeseries[,c("time","usertype")]

df_subscriber <- df_timeseries %>% subset(usertype == 'Subscriber')
df_customer <- df_timeseries %>% subset(usertype == 'Customer')
df_subscriber <- df_subscriber %>% group_by(time) %>% summarise(Number = n())
df_customer <- df_customer %>% group_by(time) %>% summarise(Number = n())

ts_subscriber <- xts::xts(df_subscriber$Number,as.Date(as.yearmon(df_subscriber$time)))
ts_customer <- xts::xts(df_customer$Number,as.Date(as.yearmon(df_customer$time)))
ts <- cbind(ts_subscriber,ts_customer)
dygraph(ts,main = 'Impact of season',ylab = "Frequency") %>% dySeries('..1',label = 'subscriber ') %>% dySeries('..2',label = 'customer') %>% dyLegend(show = "always", hideOnMouseOut = FALSE,width=400) %>%
  dyOptions(colors = RColorBrewer::brewer.pal(3, "Set2"))

From the graph above, we can observe a huge drop during cold weather like Januray. We can infer that season factor indeed has a huge impact on frequency.

5. Analysis about customer flow

library(lubridate) 
df_timeslot <- mutate(df,hour = hour(as.POSIXct(starttime)))
df_timeslot <- mutate(df_timeslot,wday = wday(as.POSIXct(starttime)))
df_timeslot <- df_timeslot[,c("hour","wday","minute","start.station.name","usertype")]
timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour))) +geom_bar(fill='#fb6a4a')+ggtitle('Time slot in trip data group by usertype')
timeslot + facet_wrap(~usertype) + xlab('Hours')+ylab('Frequency')+theme_grey(16)

From the above analysis, the overall time slot has two peaks: in 8-9 am and 5-6 pm, which is reasonable since they are the rush hour, which has a large number of passenger flow.
However, the time slot distribution differs a lot with aspect to different usertype and different weekdays.Take the above images as examples, the customer has a higher peak in the time slot which is around 4 pm while the subscribers has two peaks.

timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour)))  +geom_bar(fill='#fb6a4a')+ggtitle('Time slot in trip data group by weekdays')
timeslot + facet_wrap(~as.factor(wday)) +xlab('Hours')+ylab('Frequency')+ theme_grey(16)

Also, the weekdays play a big role in the time duration distribution. On weekdays, we still sees similar two peaks. However, on weekends, we can only observe one peak. Thus, we can think some specific pricing strategy for a specific time slot in the weekdays.

Executive summary (Presentation-style)

Our team ran an explanatory data analysis and visualization on the Citi Bike dataset in the June of 2018. In terms of trip length, we visualized the distribution of time duration as follows:

ggplot(data=df,mapping=aes(minute)) + geom_histogram(fill='#fb6a4a',color='black')+ggtitle('Distribution of time duration') +xlab('minutes over one trip') + ylab('Frequency of trips')+ theme_grey(16)

Some interesting discoveies are listed as follows.

1. Abnormal peak in birth year 1969

In visualizing the distribution of Citi Bike users’ birth year, we observed that the birth year reaches its peak at 1988. So the typical users may be in the age around 30 years old. However, an abnormality is also observed as illustrated in the following graph(the lower part of graph is omitted here). There exists a sudden peak in the year 1969.We looked a little more deeply in the data and observed that after 2018, there is this abnormal peak, while there is no such peak in years before 2018. Therefore, we made a hypothesis: when imputing missing values after 2018, Citi Bike system set the value to 1969, possibly because it was the typical user value before. To verify this hypothesis, we could contact the Citi Bike time and asked about how they treat missing values in their user data.

ggplot(summary_18_6, aes(x = factor_year, y = Freq)) + geom_point(col = "#fb6a4a") +
  coord_flip() + xlab("user's birth year") + 
  ggtitle("Frequency of Citi Bike users' birth year in 2018/6") + theme_grey(16)

2. Spatial Distribution of bike stations

We also made a spatial analysis of how popular each Citi Bike station is in terms of the number of trips in the whole month. In the following map, the intensity of color represents the popularity of stations. If the Citi Bike time intends to vary the number of bikes according to the frequency of usage, this map can be used as a reference. We can observe a clustering of popular Citi Bike stations in midtown close to the 5th Avenue. This observation conforms to our intuition because the 5th Avenue generally has more population.

leaflet(station.info) %>%setView(-74.00, 40.71, zoom = 12) %>%
  leaflet::addProviderTiles("CartoDB.Positron") %>%
  leaflet::addCircles(lng = ~long, lat = ~lat, color =~pal(n.trips)) %>%
  leaflet::addLegend("bottomleft", pal = pal, value = ~n.trips, title = "Num of trips starting from this station") %>%
  addMarkers(lng = ~top10[1:10,]$long, lat = ~top10[1:10,]$lat,  icon = icon)

### 3. Weather and season impact on trip frequency

From our analysis, weather has an impact on trip frequency descrease, and season does a huge impact. We compared the average number of trips per day in severe and unsevere weather conditions, and observed that in severe weather (days that are rainy or foggy), the number of trips per day decreases.

ggplot(df_weather, aes(x=Severe))+
  geom_bar(colour="white", fill="#fb6a4a")+ggtitle("Number of trips per day for severe and unsevere weather condition") +theme_grey(16) + ylab('Frequency')

dygraph(ts,main = 'Impact of season',ylab = "Frequency") %>% dySeries('..1',label = 'subscriber ') %>% dySeries('..2',label = 'customer') %>% dyLegend(show = "always", hideOnMouseOut = FALSE,width=400) %>%
  dyOptions(colors = RColorBrewer::brewer.pal(3, "Set2"))

4. Rush hour in trip duration

When we were analyzing the trip data in time of day, we found that customers and subscribes are most different in that customers only have one peak hour around 3-5 PM, but subscribers have two peaks at one day: one around 8-9 AM and the other around 5-6 PM. Based on this observation, our hypothesis is that customers are mostly tourists or casual bike riders, and subscribers use Citi Bike to commute. This hypothesis follows the pattern we observed, and to verify it, we need more detailed dataset and users’ information.

timeslot1 <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour))) +geom_bar(fill='#fb6a4a')+ggtitle('Time slot in trip data group by usertype')
timeslot1 + facet_wrap(~usertype) +xlab('Hours')+ylab('Frequency')+theme_grey(16)

We also notice in weekend, people tend to have only one rush hour, the pattern is very different from weekdays.

timeslot2 <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour)))  +geom_bar(fill='#fb6a4a')+ggtitle('Time slot in trip data group by weekdays')
timeslot2 + facet_wrap(~as.factor(wday)) +xlab('Hours')+ylab('Frequency')+ theme_grey(16)

Interactive component

Our data interactive component is written with Shiny and published out at shinyapps.io. Here is the Link

Design Purpose

Our interactive part mainly focuses on the geo-based data analysis. In other words, given specific start (and end ) stations of citibikes, we aim to find some bike using patterns behind and try to understand and explain them in terms of users’ perspectives. In addition, we also provide some business advice based on the interactive analysis results, such as how to allocate the citybikes to different stations to satisfy people’s needs from different areas.

Because there are a lot of stations in New York (765 in the datasets we use in the interactive part), the popularity of different stations may vary a lot from one to anther. For decision makers, it would be necessary to know the users demands and prefences to different bike stations and thus making more resonable station construction and bike allocation plans. For users, it might be useful to know the general aeverage riding time from start stations A to end station B. Also, given the start station, we hope to find the top 5~30 frequent end stations users tend to return the bikes to from the given start station, which would also help the decision makers to do some research on the different types ouser habits. I’ll offer you some basic informations about this shiny app and show you how to use the interactive app to to data analysis as follows.

Interactive Features and How to use

The dataset of the interactive part is from July 2018. We mark all start staions of the dataset on the map, which will exhibit different clusters based on the zoom scales. Users could see a specific station on the map by zooming in or clicking the cluster to yield smaller clusters and individual stations. The staion’s name will show up when mouseovering a specific individual station.

First of all, to see the usage statistics of different stations, you could simply click the station marker, then the bar plot will pop out in the upper right corner. The bar plot will change as you click different stations. So based on this interactive function, it would be convient for the business managers to see the statistics of different bike staions and the change over time in a day. Just play around the “Click” function to see different stations, you will find that people uaually use citibikes during 5pm to 7pm. For some areas, there is another “usage peak” during 7~9 am. Of course, the demands for citibikes vary a lot from one to another, It might be necessary to construct some new bike stations according to the demands of different areas and the demands of different time in a day.

Moreover, by selecting or searching a specific “Start Stations” in the select box at the bottom of the web page, users could see the corresponding bar plot of top 10 “End Stations” from the selected “Start Stations”. It’s also convient to change the range of top popular “End Stations” by dragging slider bar “Top Popular Stations” at the bottom left. For users, it’s quite straightforward to see the “Top 5~30 Destinations” based on their current locations, which may be useful for them to choose the end stations to return the bikes. As you can see, this bar plot displays the statistics of two usertypes, i.e “Customer” and “Subscriber”. Why we want to see the statistics of the different user types? Our team suppose the number of “Subscriber” should be larger than the number of “Customer” at most stations because it’s more convenient and less expensive to use citibikes as “Subscriber”, which is the truth for most stations. However, we found an interesting “counter-example” (“abnormal station”), “Central Park S & 6 Ave”. If you select “Central Park S & 6 Ave” as the “Start Station”, it will be suprised to find that the number of “Customer” is much larger than those of “Subscriber”! Our team thinks it might be attributed to a lot of tourists around the Central Park and tourists may just tend to use citibikes once. This “abnormality” shows that it might be necessary to do more detailed data analysis based on different bike stations and make different bike aollcation plans and business strategies towards different types of people according to their demands.

Last but not least, our interactive web page alsp provides users with a boxplot to see the time duration (time cost) distribution from a specified start stations to another specified end stations. Users could select the “Start Stations” and “End Stations” to see the time cost based on different genders (Unknown, Male, Female). For citibike users, they could have a roughly idea about how long it would take from their near stations to the destinations. For business managers, it might be useful to make assumptions about why people use citibikes from station A to station B based on the time cost of the trip. Again, we use “Central Park S & 6 Ave” as example. If we select it as “Start Stations” and “End Stations”, we would see three “standard” boxplots of different genders, the distributions of the three boxplots are also pretty similar to each other. Our team still thinks it is beacause there are a lot of tourists and most people use citibikes start from “Central Park S & 6 Ave” may also return them to the start positions.

Summary & Future Plan

Our interactive part mainly focuses on the geo-based data visulizayion analysis. Users could see some visulized results through some easy interactions with our shiny website, such as click, drag, select and search.

There are two functions we’ve tried so hard to add to the interactive shiny app but didn’t make it in the end. They are “Localize Me” and “Search a Place and See the Nearby Stations”. The first one is “Localize Me”, it is a button which could help users quickly localize their positions on the map and see the nearby stations and the corresponding data visualized analysis. We actually achieved this function using JavaScript on our local shiny app, but failed to deloy it at shinyapps.io. The other is “Search a Place and See the Nearby Stations”, it would help people quickly find the bike stations near the place they are interested at, which would speed up the interactive data analysis process quickly compared to search certain bike stations directly. (It’s much easier to remember the specific place’s name than the station’s location).

Conclusion

After doing the analysis of Citi Bike data, we are surprised that there are total 1.95 million ridings in June 2018. It is glad to see more people choose this healthy mode of transport. We have found some results: The active users in June 2018 is between 24 and 37 years old; More than three fifth of users are female; In June 2018, Popular stations are in Midtown in manhattan. (other results…)

We were able to find patterns of typical user groups, typical trips and spatial information from the Citi Bike dataset. These analyses are most useful if the Citi Bike group would like to set a more personalized version of pricing or advertising strategy for Citi Bike than the current customer/subscriber split.

In the future, we could conduct analysis on the correlation between trips and weather if more detailed weather information is available. We could also make a more detailed analysis focusing on the popular stations – what aspects of trips starting or ending at those stations are different than trips that are not? What adjustments can we make in order to better utilize these information?